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Abstract 

Good accessibility of publicly funded research data is essential to secure an open scientific system and eventually 
becomes mandatory [Wellcome Trust will Penalise Scientists Who Don't Embrace Open Access. The Guardian 2012]. 
By the use of high-throughput methods in many research areas from physics to systems biology, large data collec- 
tions are increasingly important as raw material for research. Here, we present strategies worked out by interna- 
tional and national institutions targeting open access to publicly funded research data via incentives or obligations 
to share data. Funding organizations such as the British Wellcome Trust therefore have developed data sharing 
policies and request commitment to data management and sharing in grant applications. Increased citation rates 
are a profound argument for sharing publication data. Pre-publication sharing might be rewarded by a data citation 
credit system via digital object identifiers (DOIs) which have initially been in use for data objects. Besides policies 
and incentives, good practice in data management is indispensable. However, appropriate systems for data manage- 
ment of large-scale projects for example in systems biology are hard to find. Here, we give an overview of a selec- 
tion of open-source data management systems proved to be employed successfully in large-scale projects. 
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INTRODUCTION 

Dissemination of scientific data and knowledge cata- 
lyzes worldwide scientific progress. Enormous add- 
itional knowledge and insights could be extracted 
from existing projects if their data were publicly 
available. But often, these data are inaccessible for 
future research projects or data are stored in propri- 
etary data formats that are not interoperable standard 
formats. Therefore, governments, funding agencies 
and the Organization for Economic Cooperation 
and Development (OECD) commissioned studies 
concerning the best practice to handle publicly 
funded research data. First, the US national council 
on research requested full and open access to publicly 
funded research data [1]. The OECD confirmed this 



demand in its guidelines arguing that open access to 
data enables 'testing of new or alternative hypotheses 
and methods of analysis' and 'exploration of topics 
not envisioned by the initial investigator' [2]. The 
studies are adopted by funding organizations so that 
in an increasing number of grant application calls 
researchers are asked to commit to an open access 
data sharing policy [3]. However, data sharing and 
good practice in data management require funding of 
an efficient infrastructure including training of 
researchers [4]. 

In smaller projects, data management is often rea- 
lized via a wiki or similar content management sys- 
tems — systems managing arbitrary contents of web 
pages. For large-scale projects, more professional 
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solutions are needed. We define the requirements for 
a data management system in the area of systems 
biology and systems medicine, respectively, and 
give an overview of data management solutions 
which have been successfully employed in large-scale 
projects. Although there has been one short publi- 
cation comparing the data management system 
MIMAS to other systems [5], this is the first com- 
prehensive review of data management systems 
successfully employed in large-scale biology projects 
which are surveyed in an environment of multi- 
national data sharing strategies. 

STRATEGIES FOR PUBLICLY 
FUNDED RESEARCH DATA 

Targeting at the maximal possible gain from the in- 
vested public funds, strategies are needed which have 
to ensure that resulting data from the funded projects 
will be publicly available and benefit the systems 
biology community. Below we describe the organ- 
izational and technical issues involved in reaching 
this goal. 

Policies 

The OECD study [3] was initiated by science and 
technology ministers and has been recommended for 
adoption by the executing entities. Thereafter, many 
funding organizations, journals and scientific institu- 
tions — e.g. Biotechnology and Biological Sciences 
Research Council [6], the Wellcome Trust and the 
Sanger Institute — have implemented data sharing 
policies. Field et al. [7] suggest to agree on a single 
data sharing policy consensus template emphasizing 
public and timely delivery of data in secure public 
databases with a long-term funding horizon. This 
suggestion appears reasonable because conflicting 
sharing policies would be a major obstacle for re- 
search in international consortia. Ministries and 
funding agencies of other countries could adapt 
these policies as a template. 

For joint European research projects, harmoniza- 
tion is already recommended in the report of the 
EU-initiated and funded project 'PARSE. Insight' 
[8]: 'An integrated and international approach is 
desired, in which policies are geared to one another 
to ensure efficiency and rapid development of 
policies'. A full discussion of the large variety of 
countries' data sharing policies and infrastructures 
would go beyond the scope of this review. Here, 
Joint Information Systems Committee (JISC) reports 



provide more detail [9]. A straightforward approach 
which is already exercised in UK, Canada and USA 
(and recommended to funding partners in other 
countries, e.g. Germany) is to make data sharing 
mandatory in order to receive (full) funding. In 
Canada, it is consent 'that all funded research be 
made openly available for future use and ensure 
this is a condition attached to future funding deci- 
sions' at the latest 2016 [10]. 

Data management plans as required prerequisite 
for funding in Australia [9, 11] can then be employed 
to control the delivery of data assets planned. Even if 
there are sharing policies installed, the practices to 
control data availability vary as the US Government 
Accountability Office (GAO) [12] reports for the 
field of climate research. The GAO report also 
gives an example for withholding of grant payments 
because of reluctance to share data with other re- 
searchers [12]. 

Restrictive use of intellectual property (IP) rights 
and privacy are identified as main hurdles for open 
access to public research data. Shublaq et al. [13] 
discuss privacy issues arising from the expected 
higher frequency of sequencing whole genomes of 
individuals due to the continuously decreasing costs 
of the technology. An International Council for 
Science report detects additional impediments in 
'trends toward the appropriation of data, such as 
genetic information and the protection of databases' 
[14]. A report of the Australian Prime Minister's 
Science, Engineering and Innovation Council 
provides an explanation for the culture of non- 
sharing as due to the pressure for commercialization 
and competition even between academic groups 
[15]. To counteract these tendencies, [14] recom- 
mends incorporating open access to research data 
into IP right (IPR) legislation. However, privacy 
issues can mostly be coped with by anonymization 
and thorough use of data in order to minimize 
the risk of misuse or disclosure. IPRs not necessarily 
have to be in conflict with open access to data. 
Data producers can grant open access to their data 
without relinquishing their authorship. Embargo 
periods can be employed to retain the release 
of data for a defined period of time — to enable 
the author to prepare a manuscript or file for a 
patent, etc. Nonetheless, these periods need to be 
justified as required by the Wellcome Trust [16]. 
Mechanisms to give credit to authors of data are an 
additional incentive to overcome the reluctance of 
sharing data. 
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Data sharing infrastructure 

The OECD study already emphasized the need for 
development and maintenance of data management 
systems in order to build up and secure a stable data 
management infrastructure. The infrastructure and 
also training programs for researchers have to be 
funded in respect to a long-term horizon in order 
to preserve data and to make it available for possible 
future investigations. Data centers are the basis of this 
infrastructure, but networks of institutional reposi- 
tories gain importance. Data federation of distributed 
international repositories — a reasonable complemen- 
tary approach to central national repositories — is not 
easy to implement because it has to deal with differ- 
ent data formats, different languages in essential 
documents and jurisdictional hurdles like license 
agreements. Ruusalepp [9] reports that 'a significant 
portion of data sharing infrastructure funding is being 
allocated to developing technical solutions for data 
federation from different repositories in one research 
domain and across domains'. 



REQUIREMENTS FOR 
LARGE-SCALE SYSTEM BIOLOGY 
DATA MANAGEMENT 

The basic functionality of a data management system 
includes (i) data collection, (ii) integration and 
(iii) delivery. Data collection (i) maintains and 
provides storage guaranteeing data security. It 
might abstract from physical storage questions, e.g. 
via references in an assets catalogue. Superior to 
manual data collection are semi-automated 
approaches allowing batch import or even fully auto- 
mated approaches via harvesting where data from 
distributed repositories are automatically transferred 
into the system — a technique also employed to crawl 
metadata from self-archived publications for the 
open archives initiative [17]. Central concepts of 
data integration (ii) are metadata (systematic descrip- 
tions of data), standardization and annotations in 
order to make data comparable. With respect to 
interoperability of different 'omics' data, the idealistic 
data management system has to comply to 
state-of-the-art community standards in the field as 
largely covered by the Minimum Information for 
Biological and Biomedical Investigations (MIBBI) 
checklists, e.g. Minimum Information About a 
Micro array Experiment (MIAME) for micro array 
transciptomics and MIAPE for proteomics. 'Omics' 
might also profit from other existing omics' 



standards, e.g. lipidomics from metabolomics. Sub- 
ramaniam et al. [18] provide information about lipi- 
domics data management. 'Omics'-specific analysis 
and visualization functionality are optional require- 
ments. As most large-scale projects are expected to 
use a large diversity of 'omics' data, systems should be 
flexible to incorporate as much 'omics' data as pos- 
sible via mechanisms to integrate the specific standard 
formats, e.g. the SysMO-SEEK Just Enough Results 
Model (JERM) templates (see 'SysMO-SEEK' sec- 
tion). However, specific requirements of projects 
restricted to dedicated 'omics' might be adequately 
met by systems tailored to that 'omics' data, e.g. 
BASE for microarray transcriptomics or XperimentR 
for a combination of transcriptomics, metabolomics 
and proteomics. Data delivery (iii) includes dissem- 
ination to a broad public — requiring researchers to 
grant access to their data. Mechanisms to make pri- 
vate data automatically publicly available after a dedi- 
cated 'embargo period' are under discussion in order 
to improve the accessibility to the scientific commu- 
nity. Support for publications, data publications and 
upload to public repositories are a requirement facil- 
itating systems biologists' work. Extensibility to new 
data types and functionality as well as intelligently 
designed interfaces for integration with other systems 
are indispensable for a system in a continuously 
changing environment. 

Further requirements are quality control and cur- 
ation of data collections guaranteeing quality and 
long-term usability of data [19]. This plays an 
important role in many databases, e.g. BioModels 
[20] where curated data sets provide a higher quality 
than non-curated data sets. 

Systems biology projects have to deal with two big 
challenges: high-throughput data and data diversity. 
High throughput in this context should be under- 
stood as a large number of experiments made in a 
short time mostly in a strongly parallelized and min- 
iaturized fashion and is topped up by state-of-the-art 
next-generation sequencing techniques generating 
huge amounts of raw data which have to be mana- 
ged in order to derive information from it [21]. 
One way to manage such data is cloud storage (and 
computing) [22]. The second big challenge is the 
diversity in large-scale systems biology projects com- 
prising data types from all phases of the systems 
biology cycle (hypothesis — experiment — evalu- 
ation — model). Thus, these projects typically com- 
prise Systems Biology Markup Language (SBML) 
models as well as nuclear magnetic resonance data, 
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proteomics data, microarrays and next-generation 
sequencing data generated in the experimental 
phase — only naming the most prominent ones. 
Another challenge emerges in large-scale systems 
biology projects like ERASysBio+ (85 research 
groups from 14 countries in 16 consortia) [23] in 
the context of the European Research Area or Sys- 
temsX in Switzerland (250 research groups in 62 
approved projects) [24] or at the Superfund Research 
Center at Oregon State University [25]: here a data 
management system has to integrate existing data 
management solutions in consortia or dedicated re- 
search groups. 

In addition to fulfilling these basic requirements, 
many data management systems include multifaceted 
features for data analysis, e.g. BASE [26] provides 
tools for microarray data analysis via a plug-in archi- 
tecture, SysMO-DB integrates modeling via 
JWSOnline [27]. Non-functional requirements like 
reliability, scalability, performance and security obvi- 
ously have to be respected in any data management 
system as well as basic functional requirements, e.g. 
storage management. Table 1 summarizes the spe- 
cific functional requirements described earlier for 
an ideal data management system in large-scale sys- 
tems biology projects. 

Standardization, metadata and 
annotation 

Standard data formats and annotations are required to 
make data comparable. The MIBBI [28] recommen- 
dations represent community standards for bioinfor- 
matics data in form of checklists specifying the 
minimal information needed to reproduce experi- 
ments. They are well established in many areas, 



e.g. MIAME [29] for microarrays, while other areas 
as novel sequencing techniques require further elab- 
oration. MIBBI is complemented by many other 
standards, e.g. protein standards initiative — molecu- 
lar interactions (PSI-MI) [30] in the proteomics area, 
mass Spectrometry Markup Language [31] in the 
mass spectrometry area, BioPax [32] as exchange 
format for pathway data, SBML [33] for the descrip- 
tion of systems biology models in xml and similarly 
CellML [34] with the ability to describe arbitrary 
mathematical models although the focus is biology, 
CabosML [35] for the description of carbohydrate 
structures. There are also efforts to standardize the 
graphical notation in systems biology in SBGN [36] 
and to find lightweight syntax solutions to exchange 
observation data [37]. Here, we can only name the 
standards with the highest relevance to systems biol- 
ogy data management and refer to Brazma etal [38] 
for a more detailed discussion of standards in systems 
biology. Many standard data formats are defined in 
xml to adequately represent metadata — systematic 
descriptions of data which is indispensable to com- 
pare and reproduce. Besides standard data formats 
'ontologies' are frameworks to standardize the 
knowledge representation in a domain. The Open 
Biological and Biomedical Ontologies foundry [39] 
provides a suite of open-source ontologies in the 
biomedical area. Common examples in bioinfor- 
matics are the formal characterization of genes in 
the Gene Ontology [40] and of gene expression 
experiments in Microarray Genomics Data Society 
(MGED) ontologies [41]. 

Annotated information is required for data inte- 
gration using matching identifiers, e.g. two micro- 
array experiments on different platforms can be 



Table I: Functional requirements for data management systems in large-scale systems biology projects 



No. Requirement Notes 



1 Support for standard data formats MIBBI, SBML, PSI-MI, mzML, etc. 

2 Assistance in metadata annotation E.g. suggesting predefined values from ontologies 

3 Automation of data collection E.g. via harvesting from distributed repositories 

4 Support for modelling data Primarily SBML models, CellML models 

5 Support for upload to public repositories NCBI GEO, EBI ArrayExpress, BioModels, JWS-Online, etc. 

6 Extension system To new data types and functionality (e.g. plug-ins) 

7 Integration with heterogeneous DM systems SW design, interfaces, web interfaces, servlets 

a Fine-grained access control Keep data private, share with dedicated users, groups, world 

b Embargo periods Retain data publishing until predefined time points 

c Support for publications, data publications Providing data for supplementaries, data publications 

d Support for large data Depends on the technique, e.g. next-generation sequencing 

e Analysis and modeling functionality Optional 

f Connectivity to relevant external resources (Optional) via integration (data warehouses) or links 
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compared by mapping the vendor-specific probe ids 
to ENSEMBL gene ids. However, annotations are 
changing and use of synonyms is common. Here, the 
connexin GJA1 might serve as an example which 
was formerly often referred to as CX43. Further- 
more, many annotations are ambiguous — e.g. one 
ENSEMBL gene id might map to multiple Illumina 
probe ids. Thus, data management systems should 
keep original identifiers and update annotations on 
demand. 

Level of integration 

The integration of resources can be handled in mani- 
fold ways. The extremes poles are simple hyperlinks 
connecting resources versus full integration in a data 
warehouse structure [42] . BioMart [43] is an example 
of a data warehouse system. Castro et al. [44] and 
Smedley et al. [45] assess several approaches for 
integration of functional genomics data. Goble 
et al [46] provide a description of the different 
levels of integration and locate mashups at the light- 
weight integration extreme. Mashups [47] are aggre- 
gations of multiple web services, e.g. an aggregation 
of microarray data with interactions from the human 
immunodeficiency virus type 1 (HIV-1), Human 
Protein Interaction Database for investigation of 
the HIV-1 [48]. Workflows [49, 50] can be regarded 
as a dedicated type of mashup. These techniques 
depend on the availability of web services, which 
are rapidly spreading and can be retrieved from 
collections like BioCatalogue [51]. Semantic web 
technologies have gained importance during the 
last years and have been employed for pilot projects, 
e.g. LabKey and SysMO-SEEK. Semantic web uses 
knowledge representation to achieve an improved 
exploitation of web resources [52, 53]. Semantic 
web concepts are Resource Description Framework 
(PDF) for simple descriptions of resources and rela- 
tionships between them and Web ontology language 
for a language about ontologies (see 'Level of 
Integration' section). Applications of these concepts 
in systems biology include the Systems Biology 
Ontology [54] which adds semantics to models 
allowing, e.g. to give understandable names to reac- 
tion rate equations — and the tool RightField [55] 
employing ontologies to enhance annotation of 
data. PJDFs emerge as an appropriate method to 
improve metadata. An example for integration of 
bioinformatics data from various databases via these 
semantic web technologies is Bio2RDF [56]. 



User commitment 

The basic challenge is to convince the participants on 
all levels (PI to technician) of the necessity to share 
experimental data and reliably use data management 
systems. There is variation in the sharing- culture in 
dedicated scientific disciplines. One exceptional 
example of outstanding community spirit are the 
human genome project's Bermuda principles with 
the requirement to upload sequences to a public 
database within 24 h [57] but usually the willingness 
to share systems biology data is not that high 
(see 'Policies' section). Swan etal. [58] studied reasons 
why researchers do not want to share and give a 
detailed description that emphasizes lack of resources 
and lack of expertise as the main reasons why data are 
not shared easily. Thus, one major reason for the 
reluctance to share data before publication is that 
researchers want to prevent competitors from antici- 
pating publications based upon their data. To lever- 
age user commitment to share unpublished data, 
introduction of a rewarding system by giving citation 
credits to unpublished data in public databases in 
form of digital object identifiers (DOIs) has been 
proposed [59, 60]. DOIs are issued by the Interna- 
tional DOI Foundation and are already in use for 
conventional citations. An interesting approach 
toward data citation is the signaling gateway mol- 
ecule pages (SGMP) — a database providing struc- 
tured data about signaling proteins [61]. SGMP 
does not implement DOI services directly for data 
objects but for review articles about signaling pro- 
teins. Now DOIs have been used for giving credit to 
data producers, e.g. [62, 63]. With the goal of estab- 
lishing a DOI infrastructure for data in 2009, the 
international 'DataCite' organization [64] was 
founded. DataCite DOIs can be minted via Data- 
Cite's MetadataStore Application interface (API) 
after the associated data center, institutional reposi- 
tory or supplementary data archive has been regis- 
tered with the MetadataStore service. The DOI 
minting procedure will be best integrated into data 
publishing software in order to keep references to 
data up-to-date and thus will be useful functionality 
for systems biology data management. A convincing 
argument for sharing published data is the increased 
citation rate as a consequence of sharing [65]. 
Another obstacle is the sometimes-large effort of 
converting data into a shareable format and the 
investment required for long-term preservation. A 
JISC report investigates the costs of preservation of 
research data in detail [66]. The reluctance to share 
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might be counteracted by covering the entire data 
management infrastructure with appropriate funding 
and by intelligent software systems alleviating the 
task of making data shareable. User interfaces on 
smartphones are already employed for clinical data 
management [67] and might help in making the 
use of systems biology data management systems 
more attractive and efficient. 



SURVEY OF SYSTEMS BIOLOGY 
DATA MANAGEMENT SYSTEMS 

Data management systems provide the core func- 
tionality to collect data, integrate it with other data 
and disseminate it. There is much variation in how 
these tasks are accomplished, e.g. automation of data 
collection, fine-grained access management, support 
for standard formats and submission to public repo- 
sitories. For this review, we selected open-source 
systems biology data management systems proved 
to be employed successfully in large-scale projects. 

SysMO-SEEK 

SysMO-SEEK [68] was developed for a large-scale 
transnational research initiative on the systems 
biology of microorganisms (SysMO). Eleven pro- 
jects contributed to the initiative, most of them 
having their own data management solution (see 
Figure 1). Main components are an assets catalogue 
and yellow pages which provide social network 
functionality leveraging the exchange of expertise. 
The JERM determines metadata based on the 
Investigation, Study, Assay (ISA) format [69] and 
compliant to MIBBI [28]. JERM enables useful 
comparability by a minimal compromise of metadata 
schemes differing between projects. JERM templates 
exist for the most common bioinformatics data types 
facilitating data exchange and data depositing in the 
relevant public repositories, e.g. ArrayExpress. The 
tool RightField alleviates the task of metadata gen- 
eration using ontology annotations in spread sheets 
[55]. SEEK can be easily integrated into a heteroge- 
neous landscape of data management systems because 
its software architecture with web interfaces ensures 
optimal extensibility. Projects registered to the SEEK 
are not forced to change their existing data manage- 
ment solutions. Harvesters can automatically collect 
assets held at distributed project sites and return 
them to the SEEK interface [68], where they are 
interpreted by extractors. SEEK is connected to 
many relevant external resources, modeling via 



JWS-Online [27] is directly integrated and a plug- 
in to PubMed enables linking of publications to 
supporting data in SEEK. SEEK alleviates the 
publication process by providing data for supplemen- 
taries. Data management of models is further sup- 
ported by the capability to link models to data and 
vice versa simulated data to models. Experimental 
data can be compared to results from models via 
combined plots. Besides the SysMO project SEEK 
is employed in several other large-scale systems biol- 
ogy projects, e.g. EI^ASysBio+ and the Virtual Liver 
network. SysMO-SEEK can be installed quite com- 
fortable via a virtual machine image. 

DIPSBC 

The Data Integration Platform for Systems Biology 
Cooperations (DIPSBC) [70] is based on the Solr/ 
Lucene search server and on a wiki system that are 
brought together via a Solr search plug-in into the 
wiki system. The central idea is to construct a system 
around a search engine providing efficient access to 
project data and other integrated data sets. Search 
results are interpreted depending on the dedicated 
data types. Solr requires xml and thus ensures a sys- 
tematic metadata model. The system is flexible. New 
data types can be integrated straightforwardly by 
writing the corresponding handlers. Most standard 
data types already use xml. Xml files have to be 
indexed before they can be searched efficiently 
using a syntax similar to most popular search engines. 
DIPSBC has been successfully used in many systems 
biology projects integrating diverse data types, e.g. 
several 'omics' data types and computational models. 
Although the main installation now contains about 
35 Mio entries, response times are usually <1 s. Xml 
schema definitions exist for many data types, e.g. for 
next-generation sequencing data. Figure 2 shows a 
system chart of DIPSBC. Handlers for the import of 
the MiniML format from the National Center for 
Biotechnology Information Gene Expression Omni- 
bus (NCBI GEO) repository already exist and might 
be easily rewritten for upload to public repositories. 
The installation of DIPSBC is straightforward for a 
bioinformatician and is well documented. 

openBIS 

openBIS [71] is an open-source distributed data 
management system for biologic information de- 
veloped at the ETH Zurich. The central components 
of openBIS are an application server and a data store 
server. The application server stores metadata in a 



Data management strategies for multinational large-scale systems biology projects 



71 




multi-national large-scale project 



Figure I: SysMO-DB system chart: SysMO-DB was developed for a multi-national large-scale project consisting of 
multiple 'sub'-projects with own data management solutions which were not changed. For that purpose the Just 
Enough Results Model (JERM) was introduced which aims at finding minimal information to make data comparable 
across project borders. JERM templates cater for compliance to MIBBI. Data of multiple projects are brought to- 
gether via upload to the assets catalogue which can be performed automatically using so-called JERM 'harvesters'. 
The yellow pages component provides details about projects, participating people and institutions to enable ex- 
change of expertise and association of assets to people. Many external resources are connected to SysMO-DB, 
e.g. integration of JWS-Online provides systems biological modelling facilities for project data. 



dedicated database and interfaces to the user web 
browsers. openBIS employs a generic metadata 
model based on controlled vocabularies. The Data 
Store server handles high- volume data — e.g. next- 
generation sequencing data — in a separate database 
and collects new data items via drop boxes (direc- 
tories monitored for incoming data) or via a remote 
Java API using the secure HTTPS protocol. The 
web interface provides visualization and retrieval 
functionality. openBIS is designed to allow exten- 
sions and integration of workflows with minimal 
effort and provides interfaces for the integration of 
relevant community tools. The workflow system 
P-Grade [72] has been integrated for proteomics 
data and it is planned to integrate the Galaxy work- 
flow system [50] for analysis of next-generation 



sequencing data. openBIS is successfully employed 
in the large-scale systems biologic project SystemsX 
(250 research groups in 62 approved projects) and 
several EU projects. openBIS can be installed with 
some IT expertise. A demo system is online and an 
installation guide and comprehensive documenta- 
tions are available. 

XperimentR 

XperimentR [73] developed at the Imperial College 
in London — is not distributed as software but instead 
provides all its features via the web with the intention 
to free the user from installation tasks. The system 
integrates three specialized data management systems: 
BASE for microarrays [26], OMERO [74] for mi- 
croscopy imaging data and Metabolomixed — based 
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Figure 2: DIPSBC system chart. Data are first converted to the Solr xml format ('normalized') and afterward 
indexed by the Solr search server. Then data sets can be found efficiently and will be passed to document type spe- 
cific objects which initiate processes corresponding to the dedicated data type (here MIAME data and PubMed 
data are shown). New data types can be introduced straightforwardly by adapting new objects derived from existing 
ones, e.g. the xml object. 



on the open-source system omixed [75] — for meta- 
bolomics and proteomics data. XperimentR lets users 
describe experimental metadata via a graphical 
representation based on the ISA-Tab structure. As a 
side-effect, data can be easily exported to the 
ISA-Tab format. Annotation of experiments is as- 
sisted by an integrated ontology lookup service. 
XperimentR's web approach is only feasible for 
smaller projects, for large-scale projects a distribution 
of the software would be desirable. However, the 
components BASE, OMERO and omixed are dis- 
tributed as stand-alone versions — but without 
XperimentR functionality. 

Gaggle and the Bioinformatics Resource 
Manager 

Gaggle [76] is an open-source Java software envir- 
onment developed at the Institute for Systems Biol- 
ogy. Gaggle uses only four data types (list of names, 
matrices, networks and associative arrays) to integrate 



various data resources and software. To integrate 
applications, the Gaggle boss relays messages of 
these basic data types between 'geese' (software 
adapted to Gaggle). Current geese include a data 
matrix viewer, Cytoscape [77], the TIGR microarray 
experiment viewer, a R/Bioconductor goose, a 
simple Bioinformatics web browser and Bioinfor- 
matics Resource Manager (BRM) which adds data 
management functionality to Gaggle. Gaggle is easily 
extendible to new applications via the message inter- 
face based on the four data types. Gaggle is now 
employed in the integration of databases over the 
Internet in the US Department of Energy's effort 
to build up a systems biology knowledgebase [78]. 

The BRM [79] is a freely distributed data man- 
agement system connected to Gaggle. The system is 
designed as client server architecture containing a 
PostgreSQL database storing project data, external 
data and metadata, a server providing access to the 
database and a Java front-end to the server to let the 
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user manage projects and analyze data. BRM is inte- 
grating user data with functional annotation and 
interaction data from public resources. Heteroge- 
neous data are integrated via overlapping column 
values from spread sheet formats which are employed 
to import high-throughput data. Built-in tools allow 
converting data between multiple formats. BRM can 
export data in simple file formats and in Cytoscape. 
BRM can be installed straightforwardly assisted by a 
quick-start guide. 

MIMAS 

MIMAS [5] is an open-source data management 
system for multi-omics. Together with openBIS 
MIMAS is pioneer in integrating next-generation 
sequencing techniques. The system was originally 
conceived for microarrays in a MIAME compliant 
fashion but has been extended to sequencing tech- 
niques using MGED ontologies for annotation — in 
the lack of a broadly accepted standard for sequen- 
cing data. MIMAS is implemented in Perl as a web 
application for the Apache server and is portable 
between MySQL and Oracle databases. MIMAS is 
used in about 50 academic laboratories in Switzer- 
land and France. Upload of data in MAGE-TAB 
format to European Bioinformatics Institute's (EBI) 
ArrayExpress repository is alleviated with a one-step 
upload procedure. IT expertise is needed to install 
MIMAS, a demo system is accessible via a guest 
login. 

ISA software suite 

The open-source ISA software suite [80] provides 
Java applications managing data referring to the 
ISA-Tab format. IS A- tab was specified in a backward 
compatible fashion based on MAGE-TAB [81] — the 
data format employed at the EBI repository 
Array-Express for depositing microarray data. Thus, 
the produced formats can be directly deposited at 
Array-Express. ISA is an acronym for Investigation, 
Study and Assay — three metadata categories consti- 
tuting a hierarchical structure for the description of 
experiments including associated measurements 
(assays) and the experimental context. The Biolnves- 
tigation Index (BII) Component Manager as part of 
the suite administrates ISA-Tab metadata in a rela- 
tional database. Other tools include the ISAconfi- 
gurator which enables customization of the editor 
tool ISAcreator which in turn can cater for metadata 
conformance to the MIBBI checklists. Access to the 
ISA-Tab data sets can be granted to dedicated users 



and distinguished as public or private. BII Manager 
facilitates straightforward export of data to commu- 
nity databases. The ISA software suite has been 
employed in several projects [69] and integrated 
into other data management systems, e.g. the Har- 
vard stem cell discovery engine [82]. Installation of 
the Java applications of the software suite is simple. 

BASE 

BASE [26] was developed at the Lund University for 
microarray data management. The multi-user 
open-source system uses a MySQL or PostgreSQL 
relational database, an integrated File transfer proto- 
col server for batch data transfer and a Tomcat servlet 
attached to a web server for dissemination of data. A 
laboratory information management system comes 
with BASE and thus a complete capture of all data 
concerning the experiment is enabled — needed for 
compliance with the MIAME standard. BASE sup- 
ports the most popular microarray platforms and pro- 
vides data analysis functionality amended by a 
plug-in system which also allows straightforward 
integration of other systems — like in 'XperimentR' 
section. BASE has been employed in many projects, 
e.g. [83]. Plug-ins provide import/export function- 
ality for the MAGE-TAB format for data deposit in 
EBIs ArrayExpress repository. BASE can be installed 
with some IT expertise. 

Lab Key 

LabKey server is an open-source data management 
system developed by LabKey Software with the 
focus on specimen management [84]. The LabKey 
instance Atlas is applied in a large-scale HIV project 
consisting of many consortia. LabKey provides a web 
server via the Tomcat servlet which is accessing a 
PostgreSQL or Microsoft SQLServer database. 
Access control is role-based and allows users to 
keep data private or to share with a wider commu- 
nity. Data are integrated using semantic web meth- 
ods like PDF describing the connection of uniform 
resource identifiers (URIs) in combination with SQL 
functionality. Annotations ensure compliance of data 
descriptions to ontologies. Data can be integrated via 
view summaries based on cross-reference identifiers 
shared by multiple tables. Experiments can be 
described via general templates for assays or via 
specialized types which are provided for many ob- 
jects including neutralizing antibody or microarrays. 
Extensions to new data types are facilitated by RDFs 
appropriateness to describe metadata. LabKey can 
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dynamically access external resources so that modifi- 
cations in the external data sets are immediately 
visible. 

Comparison of systems 

All systems compared here have proved to be useful 
in large-scale projects. Figure 3 shows their perform- 
ance concerning criteria corresponding to the 
requirements from Chapter 3. Compliance to com- 
munity standards in different 'omics' areas is a central 
criterion which is largely covered by the MIBBI [28] 
and MIAME [29] recommendations. 

The metadata model is often related to the stand- 
ard formats used. SysMO-SEEK provides the most 
elaborated approach for automated data collection: 
harvesters are automatically looking for new data and 
feeding it to the system. Semi-automated approaches 
are dropboxes of OpenBIS and batch import facilities 
in most other systems. 

Fine-grained access control is indispensable when 
multiple groups and consortia are working with the 
same data management system. Researchers thus can 
control if they want to keep their data private or 
with whom they want to share it. Most systems 
allow granting read/ write permissions to users, 
groups and the general public. Nearly all systems' 
functionality can be increased via extensions which 
sometimes can be taken from a pool of existing soft- 
ware. In large-scale projects, often heterogeneous 
data management solutions have to be brought 
together. SysMO-SEEK was developed to be inte- 
grated with other data management systems via a 
software architecture using web interfaces for the 
adaptation of external systems. Also, systems like 
ISA software suite [69], BASE (e.g. with Xperi- 
mentR), DIPSBC or Gaggle (e.g. with BRM) have 
been integrated with other data management solu- 
tions. However, integrating these systems or other 
systems with a landscape of heterogeneous data man- 
agement systems in a large-scale project will not 
come without effort. 

Support of upload to public repositories is invalu- 
able to alleviate data sharing. SysMO-SEEK supports 
upload to repositories like JWS- Online, Array- 
Express, NCBI GEO. ISA software suite provides 
the framework for ISA-Tab also used by Xperi- 
mentR and directly connected to EBI Array-Express 
upload. BASE, DIPSBC, Gaggle and MIMAS have 
implemented functionality concerning upload to 
NCBI GEO or Array-Express. As the DOI infra- 
structure for data is just in the process of being 



established, none of the systems compared here 
have special support for data DOIs, e.g. using Data- 
Cite's API for minting DOIs and notifications of 
URL changes. However, DOIs can be minted 
external to the data management systems and refer 
to data objects inside them, e.g. to persistent URIs as 
provided by the SEEK. 

Modeling support distinguishes dedicated systems 
biology data management systems from general pur- 
pose bioinformatics systems which nevertheless 
might cover most features needed for systems biol- 
ogy data management. While DIPSBC has the 
BioModels [20] database indexed by its search 
engine and the potential of Gaggle to integrate mod- 
eling is demonstrated in preliminary versions of the 
Systems biology knowledgebase Kbase framework 
[78]— SysMO-SEEK provides the highest level of 
modeling support in the systems compared: integra- 
tion of JWS-Online, functionality to link data and 
models and to plot experimental and simulated data 
coherently. Currently there is no system providing 
embargo periods — but SysMO-SEEK provides ser- 
vices to publish data at the end of projects. Large 
next-generation sequencing data are currently man- 
aged in openBIS and MIMAS. SysMO-SEEK pro- 
vides an archiving solution for large data. LabKey 
server and Gaggle are currently offered for cloud 
computing but might be further improved to exploit 
the full potential, e.g. via MapReduce [85] pipelines. 
However, all systems are scalable and no great obs- 
tacles can be expected getting them on the cloud. 
While most systems access external resources via web 
services or URLs, the BRM uses data warehousing 
to integrate them. Additional analysis and visualiza- 
tion functionality with respect to different 'omics' 
can be a useful option and is adequately realized 
by BASE for transcriptomics/microarrays and 
XperimentR bundling BASE, Metabolomixed 
and OMERO for transcriptomics, metabolomics 
and proteomics. 



CONCLUSIONS 

Initiatives to foster open accessibility of publicly 
funded research data are now put into practice by 
compilation of data sharing policies and mandatory 
professional data management strategies in applica- 
tion calls. Often the willingness to share biological 
data is limited and it is difficult to convince re- 
searchers to share raw data even after publication. 
Approaches based on voluntary sharing intend to 
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Figure 3: Benchmarking: data management systems for large-scale systems biology projects. 



reward researchers with publication-like citations for 
publishing data or by providing attractive data man- 
agement systems supporting their work. Making data 
citable via DO Is is promoted by the DataCite organ- 
ization and now has been used in a few initial cases. 
Beyond voluntariness — making data management 
and sharing a condition of funding, using data man- 
agement plans and controlling these — are efficient 
mechanisms to get publicly funded data shared and 
are on the road map in some countries. These actions 
will have to be accompanied by appropriate funding 
for data management infrastructure — including train- 
ing of researchers and development of appropriate 
systems. All data management systems reviewed 
here have particular strengths coming from the dedi- 
cated contexts; they have been developed in 



DIPSBC, the flexibility to easily add new data 
types and the large collection of integrated types; 
ISA-Tab, the direct connection to the EBI 
ArrayExpress repository and the good metadata an- 
notation facilities proved in many projects; 
XperimentR, the good microscopy, metabolomics, 
microarray and annotation facilities; Gaggle, the 
good extensibility and the collection of interesting 
extensions already existing; openBIS and MIMAS, 
the next-generation sequencing integration; BASE, 
the good microarray management and extension 
system proved in many projects and LabKey, the 
specimen management and semantic web features. 
However, SysMO-SEEK has clear advantages as 
an out-of-the box solution for large-scale systems 
biology projects because it can efficiently integrate 
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a heterogeneous landscape of multiple data manage- 
ment systems, is MIBBI-compliant, provides the 
most elaborate modeling support, fine-grained access 
control, automatic data harvesting, assists metadata 
annotation and supports relevant public repositories. 

The systems reviewed here provide a useful fun- 
dament for an efficient data management infrastruc- 
ture but might be further advanced by adding or 
optimizing these features: automatic data collection, 
support for conversion to standard formats, upload to 
public repositories, preparation of publication sup- 
plementaries and data publications and applying 
techniques like data warehouses and semantic web 
for a better exploitation of inherent knowledge and 
synergies. Then they will contribute to make sharing 
more attractive by alleviating researchers' work. 
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